Method for characterizing DNA sequences

ABSTRACT

A method for characterizing cDNA, which comprises: (a) cutting a sample comprising a population of one or more cDNAs or isolated fragments thereof, each having a strand complementary to the 3′ poly-A terminus of an mRNA and bearing a tail, with a first sampling endonuclease at a first sampling site of known displacement from a reference site proximal to the tail to generate from each cDNA or isolated fragment thereof a first and second sub-fragment, each comprising a sticky end sequence of predetermined length and unknown sequence, the first sub-fragment bearing the tail; (b) sorting either the first or second sub-fragments into sub-populations according to their sticky end sequence and recording the sticky end sequence of each sub-population as the first sticky end; (c) cutting the sub-fragments of each sub-population with a second sampling endonuclease, which is the same as or different from the first sampling endonuclease, at second sampling site of known displacement from the first sampling site to generate from each sub-fragment a further sub-fragment comprising a second sticky end sequence of predetermined length and unknown sequence; and (d) determining each second sticky end sequence; wherein the aggregate length of the first and second sticky end sequences of each sub-fragment is from 6 to 10; and wherein the sequences and relative positions of the reference site and first and second sticky ends are utilized to characterize said cDNA or cDNAs.

FIELD OF THE INVENTION

The present invention relates to a method for characterising DNA,especially CDNA, so that the DNA may be identified, for example, from apopulation of DNAs. The invention also relates to a method for assayingthe DNA.

BACKGROUND OF THE INVENTION

Analysis of complex nucleic acid populations is a common problem in manyareas of molecular biology, nowhere more so than in the analysis ofpatterns of gene expression. Various methods have been developed toallow simultaneous analysis of entire mRNA populations, or theircorresponding cDNA populations, to enable us to begin to understandpatterns of gene expression in vivo.

The method of “subtractive cloning” (Lee et al, Proc. Nat. Acad. Sci.U.S.A. 88, 2825-2829) allows identification of mRNAs, or rather, theircorresponding cDNAs, that are differentially expressed in two relatedcell types. One can selectively eliminate cDNAs common to two relatedcell types by hybridising cDNAs from a library derived from one celltype to a large excess of mRNA from a related, but distinct cell type.mRNAs in the second cell type complementary to cDNAs from the first typewill form double-stranded hybrids. Various enzymes exist which degradesuch ds-hybrids allowing these to be eliminated thus enriching theremaining population in cDNAs unique to the first cell type. This methodallows highly specific comparative information about differences in geneexpression between related cell types to be derived and has had moderatesuccess in isolating rare cDNAs.

The method of “differential display” (Laing and Pardee, Science 257,967-971, 1992) sorts mRNAs using PCR primers to amplify selectivelyspecific subsets of an mRNA population. An mRNA population is primedwith a general poly-T primer to amplify one strand and a specificprimer, of perhaps 10 nucleotides or so to amplify the reverse strandwith greater specificity. In this way only mRNAs bearing the secondprimer sequence are amplified; the longer the second primer the smallera proportion of the total cDNA population is amplified or any givensequence of that length used. The resultant amplified sub-population canthen be cloned for screening or sequencing or the fragments can simplybe separated on a sequencing gel. Low copy number mRNAs are less likelyto get lost in this sort of scheme in comparison with subtractivecloning, for example, and it is probably more reproducible. Whilst thismethod is more general than subtractive cloning, time-consuming analysisis required.

The method of “molecular indexing” (PCT/GB93/01452) uses populations ofadaptor molecules to hybridise to the ambiguous sticky-ends generated bycleavage of a nucleic acid with a type IIs restriction endonuclease tocategorise the cleavage fragments. Using specifically engineeredadaptors one can specifically immobilise or amplify or clone specificsubsets of fragments in a manner similar to differential display butachieving a greater degree of control. Again, time-consuming analysis isrequired.

The method of Kato (Nucleic Acids Research 12, 3685-3690, 1995)exemplifies the above molecular indexing approach and effects cDNApopulation analysis by sorting terminal cDNA fragments intosub-populations followed by selective amplification of specific subsetsof cDNA fragments. Sorting is effected by using type IIs restrictionendonucleases and adaptors. The adaptors also carry primer sites whichin conjunction with general poly-T primers allows selectiveamplification of terminal cDNA fragments as in differential display. Itis possibly more precise than differential display in that it effectsgreater sorting: only about 100 cDNAs will be present in a given subsetand sorting can be related to specific sequence features rather thanusing primers chosen by trial and error.

The method of “serial analysis of gene expression” (SAGE, Science 270,484-487. 1995) allows identification of mRNAs, or rather, theircorresponding cDNAs, that are expressed in a given cell type. It givesquantitative information about the levels of those cDNAs as well. Theprocess involved isolating a “tag” from every CDNA in a population usingadaptors and type IIs restriction endonucleases. A tag is a sample of acDNA sequence of a fixed number of nucleotides sufficient to identifyuniquely that cDNA in the population. Tags are then ligated together andsequenced. The method gives quantitative data on gene expression andwill readily identify novel cDNAs. However, the method is extremelytime-consuming in view of the large amount of sequencing required.

All of the above methods are relatively laborious and rely uponsequencing by traditional gel methods. Moreover, the methods requireamplification by PCR, which is prone to produce artefacts.

Methods involving hybridisation grids, chips and arrays are advantageousin that they avoid gel methods for sequencing and are quantitative. Theycan be performed entirely in solution, thus are readily automatable.These methods come in two forms. The first involves immobilisation oftarget nucleic aids to an array of oligonucleotides complementary to theterminal sequences of the target nucleic acid. Immobilisation isfollowed by partial sequencing of those fragments by a single basemethod, e.g. using type IIs restriction endonucleases and adaptors. Thisparticular approach is advocated by Brenner in PCT/US95/12678.

The second form involves arrays of oligonucleotides of N bp length. Thearray carries all 4^(N) possible oligonucleotides at specific points onthe grid. Nucleic acids are hybridised as single strands to the array.Detection of hybridisation is achieved by fluorescently labelling eachnucleic acid and determining from where on the grid the fluorescencearises, which determines the oligonucleotide to which the nucleic acidhas bound. The fluorescent labels also give quantitative informationabout how much nucleic acid has hybridised to a given oligonucleotide.This information and knowledge of the relative quantities of individualnucleic acids should be sufficient to reconstruct the sequences andquantities of the hybridising population. This approach is advocated byLehrach in numerous papers and Nucleic Acids Research 22, 3423 containsa recent discussio n. A disadvant age of this approach is that the construction of large arrays of oligonucleotides is extremely te chnicallydemanding and expensive.

SUMMARY OF THE INVENTION

The present invention provides a method for characterising cDNA, whichcomprises:

(a) cutting a sample comprising a population of one or more cDNAs orisolated fragments thereof, each having a strand complementary to the 3′poly-A terminus of an mRNA and bearing a tail, with a first samplingendonuclease at a first sampling site of known displacement from areference site proximal to the tail to generate from each cDNA orisolated fragment thereof a f irst and second sub-fragment, each comprising a sticky end sequence of predetermined length and unknownsequence, the first sub-fragment bearing the tail;

(b) sorting either the first or second sub-fragments intosub-populations according to their sticky end sequence and recording thesticky end sequence of each sub-population as the first sticky end;

(c) cutting the sub-fragments in each sub-populati on with a secondsampling endonuclease, which is the same as or different from the firstsampling endonuclease, at a second sampling site of known displacementfrom the first sampling site to generate from each sub-fragment afurther sub-fragment comprising a second sticky end sequence ofpredetermined length and unknown sequence; and

(d) determining each second sticky end sequence;

wherein the aggregate length of the first and second sticky endsequences of each sub-fragment is from 6 to 10; and wherein thesequences and relative positions of the reference site and first andsecond sticky ends characterise the or each cDNA. Optionally, the samplecut with the first sampling endonuclease comprises isolated fragments ofthe cDNAs produced by cutting a sample comprising a population of one ormore cDNAs with a restriction endonuclease and isolating fragments whoserestriction site is at the reference site.

This invention involves a process that allows a cDNA population,generated by various means, to be sorted into sub-populations orsubsets. The process also allows the identification of individualmolecules within a subset and it allows the quantity of those individualmolecules to be determined. More specifically this invention is capableof analysing a population of cDNAs derived from a specific cell type togenerate a profile of gene expression for that cell. This profile wouldreveal which cDNAs are present and how much of each is present. Fromthis it should then be possible to determine initial quantities of mRNApresent in the cell, possibly by calibrating cDNA quantities against theexpression of a known house-keeping gene whose in vivo levels could bedetermined directly.

It is not necessary to sequence an entire cDNA to identify uniquely itspresence; only a short ‘signature’ of a few base pairs should besufficient to identify uniquely all cDNAs, given, for example, a totalcDNA population of about 80 000 in the human genome. Given also that inthe next few years the entire human genome will have been sequenced, itshould be possible to use such signatures derived by this process toacquire the entire sequence of the original cDNAs from a sequencedatabase. With the incomplete database that already exists, signaturesthat return no sequence from the database will probably be novel andthis process will readily allow them to be isolated for completesequencing. If a given signature returns more than one sequence thenthis process can readily resolve the returned sequence by acquiringfurther sequence data specifically from the sequence of interest. Thisis a feature of this process that is of great advantage over othermethods such as SAGE.

Velculescu et al, Science 270, 484-487 (1995), have tested humansequences in release version 87 of the GenBank sequence database withevery possible 9 bp sequence starting from a particular reference point,their ‘anchoring enzyme’ cutting site. Their results indicated that witha 9 bp sequence 95.5% of tags corresponded to a unique transcript orhighly conserved (>95% sequence identity over at least 250 bp)transcript family. Increasing the number of bp in the tags to 11 bp,used to test the database resulted in only a 6% decrease in the numberof tags returning more than 1 sequence from the database.

Statistically, the odds that 2 sequences with the same signature areidentical sequences, can be calculated using Bayes' Theorem:$\begin{matrix}{{P\left( {Identical} \middle| {{Same}\quad {Signature}} \right)} = \frac{P\left( {{Same}\quad {Signature}} \middle| {{Identical} \times {P({Indentical})}} \right.}{P\left( {{Same}\quad {Signature}} \right)}} & (1)\end{matrix}$

Where “|” means “given that” and, similarly: $\begin{matrix}{{P\left( {{Not}\quad {Identical}} \middle| {{Same}\quad {Signature}} \right)} = \frac{{P\left( {{Same}\quad {Signature}} \middle| {{Not}\quad {Identical}} \right)} \times {P\left( {{Not}\quad {Indentical}} \right)}}{P\left( {{Same}\quad {Signature}} \right)}} & (2)\end{matrix}$

(1) divided by (2) gives:${{Posterior}\quad {Odds}\quad {Identical}} = {\frac{{P\left( {{Same}\quad {Signature}} \middle| {Identical} \right)} \times {Prior}\quad {Odds}\quad {Identical}}{P\left( {{Same}\quad {Signature}} \middle| {{Not}\quad {Identical}} \right)} = {4^{N} \times {Prior}\quad {Odds}\quad {Identical}}}$

Where N is the number of bases in the signature. 4^(N) clearly will risevery quickly with N. The Prior Odds Identical are the known odds of tworandom sequences being identical. In terms of a non-redundant sequencedatabase this is actually zero. Thus we have 4^(N) signatures availableto search a human sequence database. This analysis assumes equiprobableand spatially uncorrelated bases, which is clearly not true for realsequences. If there is spatial correlation of bases etc., much largersignatures might be necessary but as the analysis of Velculescu et alsuggests this is not the case, longer signatures do not give greaterresolution of sequences; 9 bp is sufficient as the human genome probablycontains of the order of 80 000 sequences of which a large number areclosely related, as defined above. An 8 bp signature gives 65536distinct signatures. For experimental purposes, i.e. for analysingtissue samples this will be enough to resolve the estimated 15000distinct cDNAs that are expected in the average cell but one mightexpect that a number of signatures might return more than 1 sequence.These can fortunately be readily resolved by further analysis, asdiscussed below.

Thus, at least for human cDNAs, the aggregate length of the first andsecond sticky-end sequences of- each sub-fragment is preferably 8, andconveniently, the length of each sticky end is 4.

cDNAs from species other than humans can also be readily analysed by theprocess of the present invention. The aggregate length of the first andsecond sticky-end sequences can be tailored to the size of the cDNApopulation expected for a particular species with similar optimizationprocedures as discussed below. The size of the signature may varydepending on the size of the genome to be analysed. More general nucleicacid populations may also be analysed, such as restriction fragmentsgenerated from plasmids or small bacterial or viral genomes. Othersimilarly generated populations could similarly be analysed.

When the restriction endonuclease is used to produce fragments from thecDNAs, it is preferred that the first sampling endonuclease binds to afirst recognition site and cuts at the first sampling site at apredetermined displacement from the restriction site of the restrictionendonuclease. Preferably, the first recognition site is provided in afirst adaptor oligonucleotide which is hybridised or ligated to therestriction site of the isolated fragments. In this way, the fragmentsneed contain no recognition site for the first sampling endonuclease.Preferably, a low stringency restriction endonuclease is used togenerate the cDNA fragments, such as one which recognises a 4 base pairbinding site (e.g. NlaIII which cuts at CATG leaving a 4 bp sticky-end).If too large a binding site needs to be recognised, the probability thatno recognisable binding site is present in a specific cDNA is too great.

As an alternative to using the restriction endonuclease, the firstsampling endonuclease may bind to the reference site and cut at thefirst sampling site at a predetermined displacement from the referencesite. In either arrangement, it is necessary that a reference site beused because this site contributes to the information required toestablish each “signature”.

The importance of this step should be noted with regard to analysing apopulation of cDNAs. Cleaving the immobilised cDNAs with the ‘referenceenzyme’ (i.e. the restriction endonuclease or first samplingendonuclease) will leave fragments that are known to be terminated bythe reference site that is most 3′ on the cDNA. With the purpose ofsearching a database in mind this greatly reduces the search by startingfrom the restriction site nearest the 3′ terminus (see FIG. 8). It alsogives additional spatial information regarding the positions of the‘signature’, in that there is a defined spacing between an 8 bpsignature, say of two quadrats, and the reference site. There is a lowerprobability of an 8 bp signature occurring with a given spatialrelationship to a defined restriction site than for a given 8 bpsequence to appear at a random position in the whole cDNA or in thegenome as a whole. In this way the determinative power of an 8 bpsignature is increased so that it is sufficient to identify uniquely allor at least the vast majority of cDNAs.

It is also important to ensure no sampling endonuclease recognitionsites are present in the cDNA fragments prior to addition of adaptorsbearing the sampling endonuclease recognition site. To avoid thisproblem the cDNA can be pretreated with the sampling endonuclease beforeuse of the restriction endonuclease or for that matter the samplingendonuclease and restriction endonuclease can be the same enzyme. Thiswill generate fragments with ambiguous sticky-ends. If a different‘reference enzyme’ is to be used, the majority of these sticky-ends willbe removed by the subsequent cleavage with the ‘reference enzyme’ asthis would be chosen to cut more frequently. Those that remain will beaccounted for in the sorting process. This means that there willeffectively be two ‘reference enzymes’ and this must be taken intoaccount in the subsequent database searching by searching for bothpossible reference. sequences. This might return more sequences for eachregion of 8 bp of variable sequence, thus use of two reference enzymeswould preferably be avoided.

As a preferred alternative, to ensure the sampling endonuclease bindsonly to occurrences of its recognition sequence within an adaptor ratherthan to occurrences which may occur in the cDNA, one can synthesise thecDNA with 5-methyl cytosine and use adaptors synthesised with ordinarycytosine nucleotides. As long as one uses a sampling endonuclease thatis methylation sensitive, the sampling endonuclease will only bind tooccurrences of its recognition sequence in an adaptor.

Preferably, the second sampling endonuclease binds to a secondrecognition site and cuts at the second sampling site at a predetermineddisplacement from the first sampling site. In this way, information (inthe form of the first and second sticky-end sequences) is derived fromfirst and second sampling sites and, additionally, their displacementfrom one another and from the reference site is known. Preferably, thefirst and second sampling endonucleases each comprise a type IIsendonuclease, which may be the same as or different from one another.The second recognition site may be provided in a second adaptoroligonucleotide which is hybridised or ligated to the first sticky-end.

The process of the present invention acquires minimal sequence data sothat it is not reliant on excessive sequencing. It does not requiretraditional gel methods to acquire minimal sequence information. Sincethe entire process takes place in solution, the steps involved could beperformed by a liquid-handling robot; hence this process is highlyautomatable. Sequence data in an automated system can then be acquiredin parallel for the entire cDNA population of a cell.

The process avoids excessive sequencing using a sampling procedure,above, to generate signatures for each cDNA in a population. Thepreferred form of these signatures would be:

5′-CATGNNNNNXXXXNNNNNYYYYNNN . . . NNNAAAAAAAA-3′

Reference . . . space . . . Sample 1 . . . space . . . Sample 2 . . .unknown space . . . poly-A tail

This sort of signature would preferably be acquired from an immobilisedCDNA population but clearly a signature could be acquired from anywherein a sequence but it must be from the same defined reference point ineach sequence to be compared if minimal sequence data is to be usable.The cDNA population is preferably immobilised using the poly-A tail, inbold at 3′ terminus, using, for example, a solid phase matrix. The first4 bp of the signature, in bold, is known, as this corresponds to thereference site which could be from a low stringency ordinary type IIrestriction endonuclease. This may be used to fragment the cDNApopulation initially to generate a reference point from which samplesare taken to generate unique signature information for every cDNA in acell. The next 4 bp in bold, are acquired at a known number of bp, whichis the same for every cDNA in a population, from the ‘reference site’ bythe ‘first sampling endonuclease’, which preferably is a type IIsrestriction endonuclease. These 4 bp are unknown, but obviously only 256possibilities exist. These may be determined by pulling out subsetscorresponding to each of the possible 4 bp sequences using beads witholigonucleotides complementary to one of the possible sequences asdescribed below for the sorting procedure. The next 4 bp in bold, areagain generated at a known. distance, which is the same for every cDNAin a population, from the first sampled sequence possibly by the sametype IIs ‘sampling enzyme’ and may be determined by the ‘adaptor cycle’,as described below. Thus for every cDNA, we have a known restrictionsite that is the last one of its kind on the cDNA before the poly-Atail, separated by a known distance from a sample of the cDNA sequenceof known length. This sample in turn is separated from the next sampleby a known number of bp and the second sample length is again defined.

The sample lengths can be up to 5 bp as determined by the enzymespresently available. The distances between the samples or between thefirst sample and the reference site can be up to20 bases but the actualdistance does not matter except that it must be known.

The restriction endonuclease cutting sequence can be of any length aslong as it is a sequence that is recognised by a type IIs restrictionendonuclease but practically speaking it must such as to ensure that theenzyme actually cuts every CDNA and that the terminal fragments of thecDNAS that remain are of a reasonable length to sample subsequently withthe sampling endonuclease.

Clearly if a nucleic acid population is subjected to cleavage with arestriction endonuclease there will be sticky-ends at both termini ofthe nucleic acid fragments which in most cases would be different ateach end. This would cause problems to this sorting process.

For the purposes of this invention use of mRNA avoids this problem,since the 3′ terminus of the UTR of a mRNA is characterised by thepresence of a poly-A tail. This can be used to immobilise one terminusof each mRNA present to a matrix with a complementary poly-Toligonucleotide attached to its surface. This ensures only one terminusis exposed to subsequent cleavage by the type IIs restriction enzymeafter cDNA synthesis. After restriction all non-immobilised fragments,i.e. those without a poly-A tail are to be washed away leaving only theimmobilised terminal fragments. The purpose of this process is to derivesufficient information to identify uniquely each cDNA molecule presentin a population. As long as the terminal fragments are of the order ofabout 10 to20 nucleotides from the termination codon, this should besufficient to obtain a unique signature for every cDNA, given a maximumtotal population of about 100 000 cDNAs in the human genome.

Type IIs restriction endonucleases, the ‘sampling endonucleases’, havethe property that they recognise and bind to a specific sequence withina target DNA molecule, but they cut at a defined distance away from thatsequence generating single-stranded sticky-ends of known length butunknown sequence at the cleavage termini of the restriction products.

For example, the enzyme fok1, generates an ambiguous (i.e. unknown)sticky-end of 4 bp, 9 bp downstream of its recognition sequence. Thisambiguous sticky-end could thus be one of 256 possible 4 bpoligonucleotides (see FIG. 1). Numerous other type IIs restrictionendonucleases exist and could be used for this process as discussedbelow in section on restriction endonucleases. Their binding site can beprovided by the adaptors used as shown in FIG. 2, for example.

Numerous type IIs restriction endonucleases exist and could be used assampling enzymes for this process. Table 1 below gives a list ofexamples but is by no means comprehensive. A literary review ofrestriction endonucleases can be found in Roberts, R., J. Nucl. AcidsRes. 18, 2351-2365, 1988. New enzymes are discovered at an increasingrate and more up to date listings are recorded in specialist databasessuch as REBase which is readily accessible on the internet usingsoftware packages such as Netscape or Mosaic and is found at the WorldWide Web address: http://www.neb.com/rebase/. REBase lists allrestriction enzymes as they are discovered and is updated regularly,moreover it lists recognition sequences and isoschizomers of each enzymeand manufacturers and suppliers. The spacing of recognition sites for agiven enzyme within an adaptor can be tailored according to requirementsand the enzyme's cutting behaviour. (See FIG. 2 above).

TABLE 1 Some typical type IIs restriction endonucleases RecognitionEnzyme Name sequence Cutting site Fok1 GGATG 9/13 BstFs1 GGATG 2/0 SfaNI GCATC 5/9  HgaI GACGC 5/10 BbvI GCAGC 8/12

The requirement of the process is the generation of ambiguoussticky-ends at the termini of the nucleic acids being analysed. Thiscould also be achieved by controlled use of 5′ to 3′ exonucleases.Clearly any method that achieves the creation of such sticky-ends willsuffice for the process.

Similarly the low stringency restriction endonuclease is necessary onlyto cleave each cDNA once, preferably leaving sticky-ends. Any means,however, of cleaving the immobilised nucleic acid would suffice for thisinvention. Site specific chemical cleavage has been reported in Chu, B.C .F. and Orgel, L. E., Proc. Natl. Acad. Sci. U.S.A., 1985, 963-967.Use of a non-specific nuclease to generate blunt ended fragments mightalso be used. Preferably, though, a type II restriction endonucleasewould be used, chosen for accuracy of recognition of its site, maximalprocessivity and cheap and ready availability.

The first or second sub-fragments may be sorted in step (b) by anysorting method suitable to generate sub-populations according to theirsticky-end sequence. One method comprises dividing the sub-fragmentsinto an array of samples, each sample in a separate. container;contacting the array of samples with an array of solid phase affinitymatrices, each solid phase affinity matrix bearing a unique basesequence of same predetermined length as the first sticky end, so thateach sample is contacted with one of the possible base sequences and thearray of samples is contacted with all possible base sequences of thatpredetermined length for hybridisation to occur only between each uniquebase sequence and first sticky end complementary with one another; andwashing unhybridised material from the containers.

Thus, a heterogeneous population of nucleic acids derived by cleavagewith the sampling endonuclease, like fok1, can be sorted intosub-populations by ‘pulling out’ subsets of nucleic acids characterisedby a particular sequence at the sticky-ends. One can isolate thesub-populations using, for example, beads coated with an oligonucleotidecarrying a sticky-end complementary to that on the target subset ofnucleic acids. The beads can then be isolated, washed and released intoa clean container, which for the purposes of this process wouldpreferably be a well in an array. Clearly any means of isolating cDNAsis usable in this invention, which includes immobilising complementaryoligonucleotides onto any insoluble,-solid phase support. This might forexample include affinity chromatography, inert beads and centrifugationor any similar means, but beads, magnetic or not, are preferred. Anyappropriate container could be used but an array of wells would bepreferred for use with liquid handling robots in an automated embodimentof the process.

In an alternative embodiment, cDNA fragments generated by the firstcleavage with a type IIs restriction endonuclease to generate ambiguoussticky-ends can be sorted into sub-populations according to theirsticky-ends using a hybridisation array. Typically, this methodcomprises (i) binding the sub-fragments to a hybridisation arraycomprising an array of oligonucleotide sets, each set bearing a uniquebase sequence of same predetermined length as the first sticky end andidentifiable by location in the array, all possible base sequences ofthat predetermined length being present in the array, so that eachsub-population bearing its unique first sticky end is hybridised at anidentifiable location in the array; and (ii) determining the location toidentify the first sticky end sequence.

For a 4 bp ambiguous sticky-end, every possible combination of bases canbe accounted for with an array of 256 oligonucleotide sets.

Ideally, the fragments to be used would be the fragments free insolution generated by the first sampling endonuclease cleavage. Thesefragments would carry an adaptor at the 5′ terminus. To allow for asecond cleavage with a sampling endonuclease, the oligonucleotides onthe array would have to carry a recognition site for the second samplingendonuclease.

The step of determining each second sticky-end sequence may beaccomplished in a number of ways. By the use of the second samplingendonuclease, two further sub-fragments are generated.

Generally, immobilized fragments and fragments free in solution willhave been generated. Either sets of fragments, both bearing ambiguoussticky-ends, could be analysed to determine additional sequenceinformation.

Where a hybridisation array has been used to sort sub-fragments, thesub-fragments cut in step (c) are preferably those bound to thehybridisation array so that the further sub-fragments generated therebyremain bound to the hybridisation array. In this embodiment, the step(d) of determining each second stickyend sequence comprises contactingthe further sub-fragments under hybridisation conditions with an arrayof adaptor oligonucleotides, each adaptor oligonucleotide bearing alabel and a unique base sequence of same predetermined length as thesecond sticky end, the array containing all possible base sequences ofthat predetermined length, removing any unhybridised adaptoroligonucleotide, and determining the location of any hybridised adaptoroligonucleotide by detection of the label.

This embodiment is particularly advantageous because such arrays ofoligonucleotides can be constructed in very small chips. of perhaps 2mm² or less. This enables minimal quantities of reagents to be used andso high concentrations can be used to increase the hybridisation rate ofadaptors, which is the rate limiting step of this process.

As an alternative, where sub-populations of sub-fragments have beensorted, the step of determining each second sticky-end sequencecomprises isolating the further sub-fragments from step (c) andcontacting the further sub-fragments with an array of adaptoroligonucleotides in a cycle, each adaptor oligonucleotide bearing alabel and a unique base sequence of same predetermined length as thesecond sticky end, the array containing all possible base sequences ofthat predetermined length; wherein the cycle comprises sequentiallycontacting each adaptor oligonucleotide of the array with eachsub-population of isolated sub-fragments under hybridisation conditions,removing any unhybridised adaptor oligonucleotide and determining thepresence of any hybridised adaptor oligonucleotide by detection of thelabel, then repeating the cycle, until all of the adaptors in the arrayhave been tested.

This particular part of the process may be termed “the adaptor cycle”.

This part of the process is essentially sequencing by hybridisation andcan be understood first by explaining it for the case of a singlenucleic acid. Consider a single nucleic acid, immobilised at oneterminus to a fixed insoluble matrix, that has been cleaved at the freeterminus, as above, with fok1 thus generating a 4 bp ambiguoussticky-end.

To determine the sequence of that sticky-end one can probe theimmobilised nucleic acid with an adaptor molecule. This would be anoligonucleotide carrying a sticky-end with one, known, sequence of 4 bpof the possible 256. The adaptor would additionally carry a fluorescentprobe (and possibly a binding site for the sampling endonuclease). Ifthe adaptor is complementary to the ambiguous end of the target nucleicacid, it will hybridise and it will then be possible to ligate theadaptor to the target. The immobilised matrix can then be washed toremove any unbound adaptor. To determine whether the adaptor hashybridised to the immobilised target, one need only measure thefluorescence of the matrix. This will alsb reveal how much of theadaptor has hybridised, hence the amount of immobilised cDNA. Othermeans of detecting hybridisation may be used in this invention.Radio-labeled adaptors could be used as an alternative to a fluorescentprobe, so also could dyes, stable isotopes, tagging oligonucleotides,enzymes, carbohydrates, biotin amongst others.

The construction of adaptor oligonucleotides is well known and detailsand reviews are available in numerous texts, including: Gait, M. J.editor, ‘Oligonucleotide Synthesis: A Practical Approach’, IRL Press,Oxford, 1990; Eckstein, editor, ‘Oligonucleotides and Analogues: APractical Approach’, IRL Press, Oxford, 1991; Kricka, editor,‘Nonisotropic DNA Probe Techniques’, Academic Press, San Diego, 1992;Haugland, ‘Handbook of Fluorescent Probes and Research Chemicals’,Molecular Probes, Inc., Eugene, 1992; Keller and Manack, ‘DNA Probes,2nd Edition’, Stockton Press, New York, 1993; and Kessler, editor,‘Nonradioactive Labeling and Detection of Biomolecules’,Springer-Verlag, Berlin, 1992.

Conditions for using such adaptors are also well known. Details on theeffects of hybridisation conditions for nucleic acid probes areavailable, for example, in any one of the following texts: Wetmur,Critical Reviews in Biochemistry and Molecular Biology, 26, 227-259,1991; Sambrook et al, ‘Molecular Cloning: A Laboratory Manual, 2ndEdition’, Cold Spring Harbour Laboratory, New York, 1989; and Hames, B.D., Higgins, S. J., ‘Nucleic Acid Hybridisation: A Practical Approach’,IRL Press, Oxford, 1988.

Likewise, ligation of adaptors is well known and chemical methods ofligation are discussed, for example, in Ferris et al, Nucleosides andNucleotides 8, 407-414, 1989; and Shabarova et al, Nucleic AcidsResearch 19, 4247-4251, 1991.

Preferably, enzymatic ligation would be used and preferred ligases areT4 DNA ligase, T7 DNA ligase, E. coli DNA ligase, Taq ligase, Pfuligase, and Tth ligase. Details of such ligases are found, for example,in: Lehman, Science 186, 790-797, 1974; and Engler et al, ‘DNA Ligases’,pg 3-30 in Boyer, editor, ‘The Enzymes, Vol 15B’, Academic Press, NewYork, 1982. Protocols for the use of such ligases can be found in:Sambrook et al, cited above; Barany, PCR Methods and Applications, 1:5-16, 1991; and Marsh et al, Strategies 5, 73-76, 1992.

If the adaptor is not complementary to the ambiguous sticky-end of thetarget nucleic acid then a second probe can be tried and the aboveprocess repeated until all 256 possible probes have been tested.

Clearly one of these will have to be complementary to the ambiguous end.Once this has been found, then the terminus of the target nucleic acidwill carry also a binding site for the sampling endonuclease that willallow cleavage of the target nucleic acid exposing further bases foranalysis and the above process can be repeated for the next 4 bp of thetarget. This iterative process can be repeated until the entire targetnucleic acid has been sequenced.

In a further aspect, the present invention provides a method foridentifying cDNA in a sample. The method comprises characterising cDNAas described above so as to obtain the sequences and relative positionsof the reference site and first and second sticky-ends and comparingthose sequences and relative positions with the sequences and relativepositions of the reference site and first and second sticky-ends ofknown cDNAs, such as those available from DNA databases, in order toidentify the or each cDNA in the sample. This method can be used toidentify a single cDNA or a population of cDNAs.

In a further aspect, the present invention provides a method forassaying for one or more specific cDNAs in a sample. This assay methodcomprises performing a method of characterising CDNA as described above,wherein the reference site is predetermined, each first sticky-endsequence in sorting step (b) is a predetermined first sticky-endsequence and each second sticky-end sequence in step (d) is determinedby assay of a predetermined second sticky-end sequence. In this assaymethod, the relative positions of the reference site and predeterminedfirst and second sticky-ends characterise the or each specific cDNA. Theassay method can be used to detect the presence of a single specificcDNA or a population of specific cDNAs. The reference site and first andsecond sticky-end sequences are preferably predetermined by selectingcorresponding sequences from one or more known target cDNAs, such asthose available from a DNA database.

The invention will now be described in further detail by way of exampleonly, with reference to the following Example and the accompanyingdrawings V:

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 shows the restriction behaviour of fok1;

FIG. 2 shows the cutting behaviour of adaptor oligonucleotides;

FIG. 3 shows the structure of a preferred adaptor oligonucleotide;

FIG. 4 shows the structure of a self-removing adaptor oligonucleotide;

FIG. 5 shows a set of multiple dyes on oligonucleotide adaptors;

FIGS. 6a-c show a schematic representation of a process according to oneembodiment of the invention;

FIGS. 7a-c show a schematic representation of a process according toanother embodiment of the invention; and

FIG. 8 shows an algorithm to search a sequence database to isolate humancDNAs corresponding to signatures.

The process of the invention can be applied to a heterogeneouspopulation of immobilised nucleic acids allowing them to be analysed inparallel. To be successful when applied to a population of nucleicacids, this method relies on the fact that statistically 1 out of 256molecules within the total population will carry each of the possible 4bp sticky-ends after cleavage with fok1. The average human cell isestimated to express about 15000 distinct types of mRNA. If a cDNApopulation is sorted into 256 sub-populations by the sorting proceduredescribed above, each will contain on average 60 different cDNAs givenan mRNA population of about 15,000 transcripts. If these are thencleaved with fok1, one would expect that almost all will have differentambiguous sticky-ends (there is about a 1 in 1000 chance of there being2 distinct cDNAs having the same initial 4 bp sticky end) so for mostpurposes one can assume that a hybridisation signal corresponds to asingle cDNA type. Thus sequential addition of fluorescently labeledadaptors will allow the terminal 4 bp of a mixed population of cDNAs tobe determined, resulting in 8 bp of signature in total for each cDNA inthe population.

Fluorescence detectors can usually detect fluorescence of just a singlemolecule as long as the signal reaches the photomultiplier so choice inthe design of immobilisation matrices is crucial to ensure the fidelityof the process. This means, however, that the hybridisation signal isquantitative, when using fluorescently labeled adaptors, which willreveal how many adaptor molecules have hybridised to the immobilisedfragments. This is clearly directly proportional to the number of copiesof each cDNA that is present. Thus each hybridisation signal will alsoreveal the relative proportion of each cDNA within the population. Thiscan be related back to the in vivo levels of the mRNA by determiningdirectly the quantity of a specific mRNA in vivo, preferably one with ahigh copy number like a housekeeping gene. The ratio of this quantity tothe relative quantity of that mRNA as determined by the adaptor cyclewill be the conversion factor to calculate the original in vivoquantities of each mRNA.

Detection of fluorescent signals can be performed using opticalequipment that is readily available. Fluorescent labels usually haveoptimum frequencies for excitation and then fluoresce at specificwavelengths in returning from an excited state to a ground state.Excitation can be performed with lasers at specific frequencies andfluorescence detected using collections lenses, beam splitters andsignal distribution optics. These direct fluorescent signals tophotomultiplier systems which convert optical signals to electronicsignals which can be interpreted using appropriate electronics systems.See, for example, pp 26-28 of PCT/US95/12678. A discussion of solidphase supports can also be found on pps 12-14 of that document.

Having acquired 4 bp of sequence information in the process of sortingcDNAs into subsets, one need only perf orm the adaptor cycle once toacquire an 8 bp signature for each CDNA in a well. Using a liquidhandling robot, this can be performed simultaneously for all 256 wellsgenerated by the sorting process.

The positioning of the recognition site for fok1 in the adaptor willdetermine whether the next 4 bp exposed are the next 4 bp in thesequence. Alternatively, they may overlap partially with the last fourbase pairs thus giving partially redundant information or they may befurther downstream missing out a few bases, thus only sampling thesequence of the immobilised target nucleic acid. This is illustrated inFIG. 2. The cutting behaviour of adaptors with respect to whichnucleotides are left single-stranded in the target nucleic acid isdetermined by the spacing between the fok1 recognition site and thetarget DNA. Sequential bases can be exposed with adaptor 1, while basesare sampled at intervals by adaptor 2. With adaptor 3, redundantinformation is acquired. Adaptor nucleic acid is shown and fok1 bindingsites are underlined.

Whatever spacing is used, the spatial information relating the 4 bpoligonucleotides is retained. For the purposes of this invention asampling approach is sufficient thus allowing the smallest and mosteconomical adaptor to be constructed. FIG. 3 shows a preferred minimaladaptor for use in acquiring signatures in the present invention. Therecognition sequence of fok1 is shown.

A preferred embodiment of the process is shown in FIGS. 6a to c. In step1, mRNAs are immobilized by hybridisation to biotinylated poly-T. Thisallows capture of the population, after reverse transcription of themRNA onto avidinated glass beads. In step 2, the poly-A carrying cDNAsare treated with the restriction endonuclease and loose fragments arewashed away. In step 3, an adaptor oligonucleotide is added which bearsa sticky-end complementary to the restriction endonuclease sticky-end.The adaptor carries a recognition site for the first samplingendonuclease and, optionally, a label. In step 4, the immobilized CDNAfragments are treated with the first sampling endonuclease so as togenerate for the first time an immobilized fragment with a sticky-endand a fragment free in solution (steps 2 and 3 are only optional if theimmobilized sticky-end fragment is to be analysed). In step 5 of thisembodiment, the loose subfragments in solution are isolated from theimmobilized subfragments and sub-divided into 256 wells. Each wellcontains an insoluble matrix, preferably beads, derivatised witholigonucleotides carrying sticky-ends complementary to one of the 256possible sticky-ends. Thus, the beads in each well in step 6 willimmobilize one of the 256 possible sticky-ends from the sample which arethen ligated to the beads. Fragments that are not immobilized can thenbe washed way, thus generating a sorted population of 256sub-populations of cDNA fragments.

In step 8, the second sampling endonuclease is added to each wellcontaining the sub-population of immobilized fragments generated fromstep 7. The second sampling enzyme in this example is BspM1 whoserecognition site is provided in the same sampling adaptoroligonucleotide attached to the bead.

The ambiguous sticky-end YYYY generated in step 8 is present on both thefurther sub-fragment in solution and the further subfragment immobilizedto the bead. The further sub-fragments are therefore readily separableby washing the immobilized matrix to remove cleaved adaptors and reagentas shown in step 9.

At this stage in the process, one option for analysis is to enter the“adaptor cycle” with the immobilized fragments. This is discussed infurther detail below. If the fragments to be analysed by the adaptorcycle are free in solution, then they must be immobilized first. As asecond option, either fragment can be analysed further by a number ofother methods. If the fragment is labelled with a fluorescent dye, onecan determine the terminal sequence using a hybridisation chip. If thelabel is an immobilization effector, then cleavage fragments can beisolated, immobilized and analysed by a single base method.

Referring-to step 10 in FIG. 6c, the further sub-fragment attached tothe bead enters the adaptor cycle, as discussed in further detail below.

In a second preferred embodiment of the invention as shown in FIGS. 7ato c, steps 1 to 4 are as described above. At step 5 it is theimmobilized fragments that are sorted into sub-sets for furtheranalysis. The cDNAs on beads are divided into 256 samples and the cDNAsfrom the beads are released and the beads recovered. At step 6 in FIG.7b, to each well is added a magnetic bead bearing an oligonucleotidecomplementary to one of the possible 256 4 bp ambiguous sticky-endsgenerated by the first sampling endonuclease. After hybridisation, thebeads are recovered and washed and each bead type binding asub-population of the fragments bear a unique first sticky-end arereleased into one of 256 clean wells. The wells contain a matrix toimmobilize cDNAs permanently, such as avidinated glass beads.

In step 8, the hybridisation conditions are altered to release thebeads, which are then recovered. As a result of step 8, each well nowcontains beads with known first sticky-ends to which a known adaptor canbe added carrying a recognition site for the same sampling endonuclease(in this case, fok1). Step 9 shows the step of adding the adaptoroligonucleotide, which is hybridised to the immobilized fragment. Instep 10, the sampling endonuclease is added whereby a loose sub-fragmentand an immobilized sub-fragment, each bearing the second sticky-end, aregenerated. Either of these fragments can be further analysed, asdiscussed in relation to the first embodiment.

Use of the adaptor cycle is further described in FIG. 6c for the firstembodiment of the invention and in FIG. 7c for the second embodiment.Referring to FIG. 6c, the beads carrying the second sticky-end areanalysed using the adaptor cycle at step 10. An adaptor oligonucleotidebearing a fluorescent label is added to the beads. The adaptor containsa unique sticky-end which will be complementary to one of the 256possible four base second sticky-ends that might be present on theimmobilized sub-fragment. The sequence of the sticky-end of each adaptoroligonucleotide is predetermined. Unhybridised adaptors are washed awayand the fluorescence is measured. The cycle is repeated until all of theadaptors have been tested.

If a signature returns more than one sequence from a database, one canattempt to resolve these sequences by using the known signatureinformation. If resolving sequences is required the adaptor cycle couldbe altered using adaptors of the form below shown in FIG. 4. This figureshows a self-removing adaptor in which the addition of a samplingendonuclease results in the adaptor cleavage of only the nucleotides itadds to the target nucleic acid, thereby re-exposing the bases whosesequence is being determined. The recognition sequence shown in theadaptor in the figure is that of BspM1.

After determining the- second quadrat of a signature using adaptors ofthe form above it would be possible to remove them and then if aparticular signature had returned more than one sequence, a secondadaptor specific to the terminal 4 bp could be added to acquire afurther sample. Using an appropriate sampling enzyme this could be 2 or3 or 4 further bp of sequence, depending on requirement but clearlyfewer bases of additional sequence require fewer adaptors to determinethe sequence of the resulting sticky-ends.

Once sequence information has been derived for a cDNA, perhaps byprevious profiling, the present invention can be used to isolate aspecific cDNA fragment using the same approach but focusing on onespecific cDNA. Thus if the first 4 bp of signature are known then onecan select for that subset of all cDNAs using the corresponding magneticbead that would have been used in the sorting process. The sequence ofthe next 4 bp derived from the adaptor cycle could then be used toconstruct an adaptor carrying that appropriate sticky-end and a specificPCR primer. The desired cDNA could then be amplified using a generalpoly-T primer and the specific primer on the adaptor. The amplifiedfragment would provide a unique probe that could be used to identify thecomplete cDNA or mRNA on a Southern or Northern blot.

In order to speed up the adaptor cycle, adaptors can be added in groupsso long as individual subsets of adaptors are each labeled with adifferent fluorescent marker to permit hybridisation of each adaptorsubset to be distinguished. This sort of modification will still allowquantitative information to derived but 4 different photomultiplierswould be required to detect each label. FIG. 5 shows the use of multipledyes on adaptors which would allow groups of adaptors to be testedsimultaneously.

One potential problem with the ‘Adaptor Cycle’ is to ensure thathybridisation of probes is accurate. There are major differences betweenthe stability of short oligonucleotide duplexes containing allWatson-Crick base pairs. For example, duplexes comprising only adenineand thymine are unstable relative to duplexes of guanine and cytosineonly. These differences in stability can present problems when trying tohybridise mixtures of short oligonucleotides ( e.g. 4 mers) tocomplementary target DNA. Low temperatures are needed to hybridise A-Trich sequences but at these temperatures G-C rich sequences willhybridise to sequences that are not fully complementary. This means thatsome mismatches may happen and specificity can be lost for the G-C richsequences. At higher temperatures G-C rich sequences will hybridisespecifically but A-T rich sequences will not hybridise.

In order to normalise these effects modifications can be made to theWatson-Crick bases. The following are examples but they are notlimiting:

The adenine analogue 2,6-diaminopurine forms three hydrogen bonds tothymine rather than two and therefore forms more stable base pairs.

The thymine analogue 5-propynyl dU forms more stable base pairs withadenine.

The guanine analogue hypoxanthine forms two hydrogen bonds with cytosinerather than three and therefore forms less stable base pairs.

These and other possible modifications should make it possible tocompress the temperature range at which random mixtures of shortnucleotides can hybridise specifically to their complementary sequences.

It may also be possible to design smaller sets of adaptors with baseanalogs that bind to multiple bases such as deoxyinosine, 2-aminopurineor the like (Kong Thoo Lin et al, Nucleic Acids Research 20, 5149-5152).Such a set might have adaptors of the form below:

GGATG GGATG CCTACAANG CCTACANTG

N would represent all 4 bases at that position. Thus each adaptor aboverepresents a set of four adaptors. The two sets shown above would haveonly one common member. Each set would have one common member with fourother sets. There are only 64 sets with N at the 3rd position in thesticky-end, similarly there are only 64 sets with N at the 2nd position.Hence to identify every base uniquely, 128 sets of adaptors could beused rather than the complete 256. To resolve the overlapping sets onemight need to have some initial information about the number of cDNAs ineach of the 256 samples. Sorted sets of cDNAs of the kind to be used inthis process would have on average 60 cDNAs which could be resolved on asequencing gel. If radiolabeled or fluorescently labeled the quantitiesof each cDNA could be determined. This might be valuable in order tosave time as each adaptor set added in the adaptor cycle may take up toan hour to hybridise fully. Thus any means of increasing the speed ofthe process might be useful and worth the additional labour of producingthe gels.

Clearly also a larger tissue sample might have to be used. Constructionof redundant sets above would be made cheaper if bases with ‘wobble’could be used to reduce degeneracy.

Various single base methods of analysing nucleic acids have beenproposed and may be usable with the present invention. Most of theseavoid gel techniques of DNA sequencing and potentially could beappropriate for analysing, in parallel, the subpopulations generated bythe sorting process described above. Single base methodsare disclosed,for example, in U.S. Pat. No. 5,302,509; WO 91/66678; J. D. Harding andR. A. Keller, Trends in Biotechnology 10, 55-58, 1992; WO 93/21340;Canard et al, Gene 148, 1-6, 1994; Metzker et al, Nucleic Acids Research22, 4259-4267, 1994; PCT/US95/03678; and PCT/GB95/00109.

Use a of hybridisation chips, grids or arrays would also be practicalfor use with this invention. An array of oligonucleotides would need tocontain only 256 oligonucleotides corresponding to the 256 possible 4 bpsticky ends that would be generated by the second treatment of the cDNAfragments with the ‘sampling enzyme’. If the fragments to be analysedare labeled with a fluorescent dye then the sticky-ends in each subsetof cDNAs can be determined from the positions on the grid from whichfluorescence is observed. Analysis using hybridisation grids will alsoprovide quantitative information in the same way as the ‘Adaptor Cycle’.Such methods are described in Lehrach et Poutska, Trends Genet. 2,174-179, 1986; and Pevzner et al, Journal of Biomolecular Structure andDynamics 9, 399-410, 1991.

As further information is acquired it will be possible to develop theprocess further for example to make use of database information.

Clearly with use of this process a significant database of signaturesand their corresponding genes will be acquired. It is estimated thatthere may be as many as 10 000 housekeeping genes. For most purposes itis the tissue specific cDNAs that researchers will be interested in. Thepresence of the housekeeping genes will undoubtedly be expected and itwill be extremely wasteful to have to identify these every time theprocess is used, except perhaps for calibrating expression levels. Itwill be possible using the adaptor cycle, to ignore certain subsets ofcDNAs or miss out certain adaptors if the genes they identify are knownhousekeeping genes. This should greatly speed up the process ofprofiling a cell's cDNAs. Moreover it is highly likely that mostadaptors will not hybridise to any sequences. If the tissue specificgenes are already known, and information about abundance is all that issought then only the adaptors corresponding to the expected signaturesneed be used.

These sorts of process modifications may require liquid-handlingrobotics that are flexible in their programming.

As a further modification, the choice of restriction endonucleases canbe optimised. Since spatial correlation of bases and nucleotidefrequencies are not random in the genomes of living organisms, it mightbe found empirically that certain combinations of sampling enzymes mayresolve more sequences using the 8 bp signatures than. othercombinations and clearly these would be of great value as it would savetime spent on resolving signatures that return multiple sequences.

Similarly, once a database of cell-type specific genes is established,resolution steps will probably not be required as it will be known whichgenes, hence which signatures are to be expected in a given cell type.

Analysing cDNAs to determine sequence variation of alleleles of aparticular gene is a further application that would be of great value todevelop, in the context of analysing how these changes might alterpatterns of gene expression in a cell. Variations in alleles may altersignatures and again these sorts of effects will only become apparentwith use of this invention and will in the long term form anotherextremely useful database for improving the use of this invention.

EXAMPLE Experimental Design

Three different PCR products were used to represent 3 different genes atvarying expression levels. The PCR product used for this were exons14,16 and 19 of the anion exchanger (AE1) as these PCRs have alreadybeen optimised in our laboratory. These will be referred to as AE14,AE16 and AE19.

The products were captured to Dynalbeads (by incorporating a biotin inone of the PCR primers) and effectively represent captured cDNA. AE16was at half the concentration of AE14 and AE19 was at one fifth theconcentration of AE14.

AE14 Sequence

ccaaagctgggagagaacagaatgccttggttttctgctgcagatcttccaggaccacccactacagaagac

ttataactacaacgtgttgatggtgcccaaacctcagggccccctgcccaacacagccctcctctcccttgt

gctcatggccggtaccttcttctttgccatgatgcgcaagttcaagaacagctcctatttccctggcaa

gtcagcataccctcctcgcctgtccttgccaacactgc

AE16 Sequence

ctgggagaatgccagggaaaggtctctgcctcccaccctcccaggcccagcccccaccctgtctctcacgtg

gtgatctgagactccaggaatatgaggatgaagaccagcagagcaggcagggcggaggcaaaatcataaaga

tgggaaactcggaacgcaagcccagtgggtggatgacccagccccgggctgaggagttgacaccttgaagcc

atcaggcaccgagagtttctgtgggagggggtagcaggtaagaatgccaagggc

AE19 Sequence

gtgataggcactgaccccagcctccgcctgcaggtgaagacctggcgcatgcacttattcacgggcatccag

atcatctgcctggcagtgctgtgggtggtgaagtccacgccggcctccctggccctgcccttcgtcctcatc

ctcactgtgccgctgcggcgcgtcctgctgccgctcatcttcaggaacgtggagcttcagtgtgtgagtggc

tgcctgggcctggggcacaagagctgggagcatgcg

Following capture, they were first digested with the frequent cutter Sau3A1. This enzyme recognises the sequence GATC.

This provided the following 4 bp overhangs of each of the products.

AE14     TTCCAGGACCACC... CTAGAAGGTCCTGGTGG... AE16    TGAGACTCCAGGAATAT... CTAGACTCTGAGGTCCTTATA... AE19    ATCTGCCTGGCAG... CTAGTAGACGGACCGTC...

The following adaptor complimentary to the 4 bp overhang revealed by Sau3A1, and containing a Fok I site, was ligated to the captured fragments.

Adaptor SauFAM

FAM - CTAGAGGACGATCGA.GGATG.       GATCTCCTGCTAGCT.CCTAC.GATC                      |                      Fok I site

AE14 FAM-CTAGAGGACGATCGA.GGATG.GATC.TTCCAGGACCACC...    GATCTCCTGCTAGCT.CCTAC.CTAG.AAGGTCCTGGTGG... AE16FAM-CTAGAGGACGATCGA.GGATG.GATC.TGAGACTCCAGGAATAT...    GATCTCCTGCTAGCT.CCTAC.CTAG.ACTCTGAGGTCCTTATA... AE19FAM-CTAGAGGACGATCGA.GGATG.GATC.ATCTGCCTGGCAG...    GATCTCCTGCTAGCT.CCTAC.CTAG.TAGACGGACCGTC...

These sequences were then digested with Fok I, which cuts at 9 and 13bases from GGATG, and the following fragments were released intosolution.

AE14 FAM - CTAGAGGACGATCGA.GGATG.GATC.TTCCA      GATCTCCTGCTAGCT.CCTAC.CTAG.AAGGTCCTG AE16 FAM -CTAGAGGACGATCGA.GGATG.GATC.TGAGA      GATCTCCTGCTAGCT.CCTAC.CTAG.ACTCTGAGG AE19 FAM -CTAGAGGACGATCGA.GGATG.GATC.ATCTG      GATCTCCTGCTAGCT.CCTAC.CTAG.TAGACGGAC

The cleaved fragments were then captured, through ligation, to 3different wells of a microtitreplate each containing a specific adaptor(which contains a site for BbvI ‘GCAGC’) simulating the first stagedivision into 256 subgroups and providing the first 4 bases. Bbv I cutsat 8 and 12 bases from GCAGC.

The full sequences are shown infra

For AE14 (adaptor Bbv14) Biotin-N-GCAGC.AGA        N-CGTCG.TCT.CAGG             |              Bbv I site For AE16 (adaptor Bbv16)Biotin-N-GCAGC.AGA        N-CGTCG.TCT.CCTC For AE19 (adaptor Bbvl9)Biotin-N-GCAGC.AGA        N-CGTCG.TCT.GTCC

Where N is a number of bases

This produced the following sequences:

For AE14 Biotin-N-GCAGC.AGA.GTCCTGGAAGATC.CATCC.AGCTAGCAGGAGATC       N-CGTCG.TCT.CAGGACCTTCTAG.GTAGG.TCGATCGTCCTCTAG-FAM For AR16Biotin-N-GCAGC.AGA.GGAGTCTCAGATC.CATCC.AGCTAGCAGGAGATC       N-CGTCG.TCT.CCTCAGAGTCTAG.GTAGG.TCGATCGTCCTCTAG - FAM For AE19Biotin-N-GCAGC.AGA.CAGGCAGATGATC.CATCC.AGCTAGCAGGAGATC       N-CGTCG.TCT.GTCCGTCTACTAG.GTAGG.TCGATCGTCCTCTAG-FAM

At this point the concentration was measured through fluorescence of theFAM label and the first 4 bases (XXXX) determined.

Following this the fragments were digested with Bbv I and the next 4 bprevealed.

For AE14 Biotin-N-GCAGC.AGA.GTCCT        N-CGTCG.TCT.CAGGACCTT For AR16Biotin-N-GCAGC.AGA.GGAGT        N-CGTCG.TCT.CCTCAGAGT For AE19Biotin-N-GCAGC.AGA.CAGGC        N-CGTCG.TCT.GTCCGTCTA

Following digestion 3 different adaptors, complementary to the 3different 4 bp over hangs were then ligated to each well in turn tosimulate the ‘adaptor cycle’ and the fluorescence measure at each stage.

These adaptors were

AE14 (adaptor C14) GGAA.GATCCTGGACAGTTG      CTAGGACCTGTCAAC-FAM AE16(adaptor C16) CTCA.GATCCTGGACAGTTG      CTAGGACCTGTCAAC-FAM AE19(adaptor C19) AGAT.GATCCTGGACAGTTG      CTAGGACCTGTCAAC-FAM

Successfully ligation, measured by fluorescence therefore providedconcentration information and the next 4 bases (YYYY) of the ‘tag’.

Tag—GATC.YYYY.N.XXXX

Where GATC corresponds to the Sau 3A1 site, XXXX the first 4 basesuncovered by the Fok I digestion which is separated by a single unknownbase, N, to YYYY which corresponds to the next 4 bases revealed by BbvI.

Materials and Methods Adaptor Sequences and Preparation

SauFam

5′-FAM-CTAGAGGACGATCGAGGATG-3′     3′-GATCTCCTGCTAGCTCCTACCTAG-PO4-5′

‘Bbv” Adaptors

Bbvl4 5′BIOTIN-6C-CCTAGACTAGAGGACCGATCGAATCAGCAGCAGA-3′          3′-GATCTGATCTCCTGGCTAGCTTAGTCGTCGTCTCAGG-PO4-5′ Bbvl65′BIOTIN-6C-CCTAGACTAGAGGACCGATCGAATCAGCAGCAGA-3′          3′-GATCTGATCTCCTGGCTAGCTTAGTCGTCGTCTCCTC-PO4-5′ Bbvl95′BIOTIN-6C-CCTAGACTAGAGGACCGATCGAATCAGCAGCAGA-3′          3′-GATCTGATCTCCTGGCTAGCTTAGTCGTCGTCTGTCC-PO4-5′

Cycling Adptors

C14 5′FAM-CAACTGTCCAGGATC-3′    3′-GTTGACAGGTCCTAGAAGG-PO4-5′ C165′FAM-CAACTGTCCAGGATC-3′    3′-GTTGACAGGTCCTAGACTC-PO4-5′ C195′FAM-CAACTGTCCAGGATC-3′    3′-GTTGACAGGTCCTAGTAGA-PO4-5′

BioFAMFok

5′BIOTIN-GGTCACTTAGATCGATCCATGAGGATGCTTCATTCTGATTCAGTCC-3′      3′-CCAGTGAATCTAGCTAGGTACTCCTACGAAGTAAGACTAAGTCAGG-FAM

BioG

5′BIOTIN-GCATCTGGAGTCTACAGTCGTCTATTGACG-3′      3′-CGTAGACCTCAGATGTCAGCAGATAACTGCCGGC-PO4-5′

GCCG

5′FAM-GCATCAGGATGTACAG-3′    3′-CGTAGTCCTACATGTCGCCA-PO4-5′ FAM-fluorescein    PO4 - phosphate

All primers were purchased from Oswell DNA Services.

All adaptors were made but heating 200 ul of TE containing each primerat 20 pmol/ul concentration at 90° C., in a Techne Dryblock and allowingthe block to cool to room temperature over 2 hours. The adaptors werethen incubated on ice for 1 hour and then frozen at −20° C. until used.

Binding Bbv14,16, and 19 Adaptors to Microtitre Plate

In order to capture the Fok 1 cleaved fragments to the ‘Bbv’ adaptorsvia ligation the ‘Bbv” adaptors were bound to black, streptavidin coated96 well microtitre plates (Boehringer Mannheim). This was achieved byincubating 10 pmol of the appropriate adaptor in 35 ul of 1×TE+0.1M NaClin each well overnight at 4° C. Following the overnight incubation eachwell was washed 3 times with 50 ul of 1×TE+0.1M NaCl. The 1×TE+0.1M NaClwas removed and 50 ul of 1×ligase buffer was added to each well and theplate was stored at 4° C. until used.

Plate Capacity

To determine the binding capacity of each well 10 pmol of BioFAMFokadaptor was bound to 8 wells by incubating lopmol of the adaptor in 25ul of 1×TE+0.1M NaCl in each well overnight at 4° C. Following theovernight incubation each well was washed 3 times with 50 ul of1×TE+0.1M NaCl. A dilution of BioFAMFok (5, 2.5, 1.25, 0.675, 0.3375pmol) diluted in 1×TE+0.1M NaCl was added. to a series of well and thefluorescence of the plate read in a Biolumin Microtiter plate Reader(Molecular Dynamics).

The following readings (expressed as Relative Fluorescent Units) wasobtained.

Dilution Wells

5 pmol 74575 RFU 2.5 pmol 35429 RFU 1.25 pmol 16232 RFU 0.625 pmol 9388RFU 0.3375 pmol 4807 RFU

Wells incubated with lopmol of adaptor and washed

20872 RFU

21516 RFU

22519 RFU

21679 RFU

22658 RFU

21517 RFU

21742 RFU

22417 RFU

mean 21865

From these figures one can calculate that 21856 RFUs is equal to 1.5pmol of BioFAMFok. This data agree with the capacity of the wells tobind biotinylated double stranded DNA (5 pmol hybridised in 200 ul)provided by Boehringer Mannheim technical help line.

Effect of Tween20 on Ligation

The addition of 0.1% Tween 20 to the reaction buffer used with Fok 1 isclaimed to reduce the exonuclease activity associated with this enzyme(Fok 1 data sheet—New England Biolabs). The following experiment wasperformed in order to determine if the addition of Tween would have anyeffect on the subsequent ligation of the cleaved fragments.

Nine reactions were set up with each set of three reactions eachcontaining either 0, 0.05 or 0.1% tween in 25 ul of 1×ligase buffer, 10pmol BioG adaptor, 10 pmol GCCG adaptor and 200 ul ligase (New EnglandBiolabs) . One set of three reactions was set up as the above with 0.1%tween and no ligase. These were then incubated at 16° C. for 1 hours andthen each reaction transferred to a well of a black streptavidin coatedmicrotitre plate (Boehringer Mannheim). The plate was incubated at roomtemperature for one hour and each well washed 3 times with 100 ul of TESand the fluorescence measured in a Biolumin Microtiter plate Reader(Molecular Dynamics).

The following readings (expressed as Relative Fluorescent Units) wasobtained.

0.1% tween 20 0% tween 20 0.05% tween 20 0.1% tween 20 (no ligase) 85928742 10213 3660 8083 8712 10605 3967 8720 8519 11598 3468 8465 865710805 3698-means

The above data demonstrate that the inclusion of 0.1% tween 20 increasesligation efficiency and therefore should not be detrimental to theligation of the Fok 1 cleaved fragments to the ‘Bbv” adaptors.

PCR Primers and Conditions and Purification

The 3 PCR products used to represent cDNA transcripts at differentconcentrations were exons 14,16 and 19 from the human erythrocyte anionexchanger gene located on chromosome 17q21-22.

Primer sequences use to amplify exons 14,16 and 19

Exon 14

Forward primer

5′-GTATTTTCCAGCCCAAGCCAAAGCTGG-3′

Reverse primer

5′BIOTIN-GCAGTGTTGGCAAGGACAGGC-3′

Exon 16

Forward primer

5′BIOTIN-GCCCTTGGCATTCTTACCTGC-3′

Reverse primer

5′-CTGGGAGAATGCCAGGGAAAGG-3′

Exon 19

Forward primer

5′-GTGATAGGCACTGACCCCAG-3′

Reverse primer

5′BIOTIN-CGCATGCTCCCAGCTCTTGTGC-3′

The inclusion of biotin into one of the primers in each set will allowtheir capture to streptavidin coated beads (Dynal UK)

All PCR reactions were performed in 50 ul containing 1×Amplitaq buffer(Perkin Elmer), 30 pmol of forward and reverse primer, 200 uM dNTPs,1.25 units of Amplitaq (Perkin Elmer) and 100 ng of human genomic DNA.The reactions were overlaid with 50 ul of mineral oil and cycled on aTechne ‘Genie’ PCR machine with the following conditions.

Exon 14

1 cycle 95° C. for 2 min

35 cycles 57.5° C. for 45 sec, 72° C. for 1 min, 95° C. for 35 sec

1 cycle 72° C. for 5 min

Exon 16

1 cycle 95° C. for 2 min

35 cycles 52° C. for 45 sec, 72° C. for 1 min, 95° C. for 35 sec

1 cycle 72° C. for 5 min

Exon 19

1 cycle 95° C. for 2 min

35 cycles 57.5° C. for 45 sec, 72° C. for 1 min, 95° C. for 35 sec

1 cycle 72° C. for 5 min

Purification

Excess primers and salts need to be removed before the PCR products arebound to DynaBeads, this is performed as described below.

10 reactions of each were pooled following PCR, separately, prior topurification. The PCR products were then ethanol precipitated by adding2.5 volumes of 100% ethanol and one tenth of a volume of 3M sodiumacetate. The solution was then incubated at −20° C. for 30 minutes andthen spun at 13000 rpm in a Heraeus A13 benchtop centrifuge for 15minutes to precipitate the DNA. The supernatant was then poured off andthe pellet allowed to air dry. The dry pellet was then resuspended in150 ul of water. Following this, 2 Chromospin-100 columns (Clonetech)were prepared for each sample by spinning the columns in a Hereaus 17RScentrifuge for 3 minutes at 3500 rpm according to the manufacturer'sinstructions. Following centrifugation 75 ul of the DNA solution wasadded to each prepared column and spun as before collecting the purifiedDNA into a 1.5 ml eppendorf tube. The 2 samples for each exon were thenpooled and the DNA concentration measured-by reading the absorption at260 nm and 280 nm in a Pharmacia Genequant spectrophotometer.

Solutions and Buffers

1×TE pH7.6

10 mM Tris HCl

1 mM EDTA

TES pH7.5

10 mM Tris-HCl

1 mM EDTA

2M NaCl

1×Fok I buffer pH7.9

50 mM potassium acetate

20 mM Tris Acetate

10 mM magnesium acetate

1 mM DTT

1×Bbv I buffer Ph7.9

50 mM NaCl

10 mM Tris-HCl

10 mM MgCl2

1 mM DTT

1×Sau 3A buffer pH7.9

33 mM Tris acetate

66 mM potassium acetate

10 mM magnesium acetate

0.5 mM DTT

1×Ligase buffer pH7.8

50 mM Tris-HCl

10 mM MgCl2

10 mM DTT

1 mM ATP

50 ug/ml BSA

Results

Concentrations of Column Purified DNA

exon 14—130 ng/ul

exon 16—120 ng/ul

exon 19—115 ng/ul

1 ug exon14 (255 bp)=5.9 pmol, 1 ug exon16 (272 bp)=5.58 pmol, 1 ugexon19 (252 bp)=6.03 pmol

1 ug exon14=7.7 ul, 1 ug exon16=8.3 ul, 1 ug exon19=8.7 ul thereforeexon 14=0.76 pmol/ul, exon 16=0.67 pmol/ul, exon 19=0.69 pmol/ul

Sau 3A1 Digest

30, 15 and 6 pmol of column purified exons 14, 16 and 19, respectively,were digested with 20 units of Sau 3A1 in 100 ul of 1×Sau 3A1 buffer at37° C. for 4 hours.

exon14 39.5 ul exon16 22.4 ul exon19 8.7 ul Sau 3A1 5 ul 10 × Sau 3A1buffer 10 ul H2O 14.4 ul

Following digestion the reaction mix was heated at 65° C. in a TechneDryblock for 20 minutes to inactivate the enzyme.

Preparation of DynaBead M280

According to the manufacture's instructions 3 mg of DynaBeads M280 willbind 60-120 pmol of biotinylated double stranded DNA.

300 ul of DynaBeads M280 at 1 mg/ml were washed with 100 ul TES byholding the beads to the side of an eppendorf tube with a MagneticParticle Concentrator (Dynal UK) so that the supernatant could beremoved. This was repeated three times (All subsequent bead manipulationwere carried out in this manner according to manufacture'sinstructions). The beads were resuspended in 100 ul of TES and the Sau3A1 digested DNA added and incubated at room temperature for 1 hour toallow the biotinylated DNA to bind to the beads.

The Beads/DNA were then washed three times with 1×ligase buffer usingthe Magnetic Particle Concentrator (Dynal UK) as before.

Ligation of SauFAM Adaptor (Containing Fok I Site)

The supernatant was removed and the beads/DNA were resuspended in 75 ulof 1×ligase buffer containing 300 pmol of SauFAM adaptor and 4000 unitsof ligase (New England Biolabs).

Beads/DNA, 7.5 ul 10 ligase buffer, 15 ul SauFAM (at 20 pmol/ul), 10 ulligase (at 400 units/ul), 42.5 ul H2O.

The reaction was then incubated at 16° C. for 2 hours.

Fok I Digestion

Following ligation the beads/DNA were was 2 times with 75 ul of 1× Fok Ibuffer and the resuspended in 10 ul of 1×Fok I buffer and heated at 65°C. in a Techne Dryblock for 20 minutes to inactivate any remainingligase. The buffer was was removed and the beads/DNA resuspended in 95ul of 1× Fok I buffer containing 20 units of Fok I (New EnglandBiolabs).

Beads/DNA, 9.5 ul 10×Fok I buffer, Sul Fok I (at 4 units/ul)

The beads/DNA were then incubated at 37° C. for 2 hours.

Following incubation the supernatant, containing the fragments cleavedby Fok I, was then transferred to a fresh eppendorf tube and heated at65° C. for 20 minutes in a Techne Dryblock in inactivate the Fok I.

Ligation of Fok I Cleaved Fragments to Bbv Adaptors on Microtiter Plate

The Fok I fragments were then divided into three tubes each containing30 ul of Fok I cleaved fragments, 5 ul of 10×Ligase buffer, 3 ul ligase(at 400 uints/ul—New England Biolabs) and 12 ul of H2O.

The ligase buffer on a plate containing adaptors Bbv14, 16, 19 inseparate wells (prepared as previously described) was removed and theabove reaction mixtures, containing the Fok I cleaved fragments andligase, added to each.

The wells were then incubated at 16° C. for one hour and then washedthree times with 50 ul of TES. The TES was removed from the wells,another 50 ul of TES added and the fluorescence measured in BioluminMicroplate reader (Molecular Dynamics). A well to which no fragmentswere added and just contained Bbv adaptors was used as a blank.

Data Expressed as RFUs

Bbv14 well 1774 RFU Bbv16 well 1441 RFU Bbv19 well 1192 RFU Blank 1010RFU

The reading from the blank well, which is a background reading, wassubtracted from the reading of the other wells and gave the following.

Bbv14 well 764 RFU Bbv16 well 431 RFU Bbv19 well 182 RFU

As half as much of exon 16 compared to exon 14 (15 pmol exon 16, 30 pmolexon 14) was included into the procedure the reading obtained from theBbv16 well should be half (i.e. 50%) of that obtained from the Bbv14well and as one fifth the amount of exon 19 compared to exon 14 (6 pmolexon 19, 30 pmol exon 14) the reading obtained from the Bbv19 wellshould be one fifth (i.e. 20%) that obtained from the Bbv14 well.

Ideal Reading Expressed As Percentages

Bbv14 well 100 Bbv16 well  50 Bbv19 well  20

Actual Readings Expressed As Percentages (using Bbv14 well as 100%)

Bbv14 well 100 Bbv16 well 56.4 Bbv19 well 23.8 Bbv16 well 6.4% errorBbv19 well 3.8% error

Therefore, this process is capable of separating a mixed population ofDNA, and identifying 4 bp, while at the same time maintaining therelative proportions of the original mixture with minimal errors. Whichin turn can then be reprobed to obtain another 4 bp and the associatedquantitative data.

41 254 base pairs nucleic acid unknown unknown DNA (genomic) unknown 1CCAAAGCTGG GAGAGAACAG AATGCCTTGG TTTTCTGCTG CAGATCTTCC AGGACCACCC 60ACTACAGAAG ACTTATAACT ACAACGTGTT GATGGTGCCC AAACCTCAGG GCCCCCTGCC 120CAACACAGCC CTCCTCTCCC TTGTGCTCAT GGCCGGTACC TTCTTCTTTG CCATGATGCT 180GCGCAAGTTC AAGAACAGCT CCTATTTCCC TGGCAAGTCA GCATACCCTC CTCGCCTGTC 240CTTGCCAACA CTGC 254 270 base pairs nucleic acid unknown unknown DNA(genomic) unknown 2 CTGGGAGAAT GCCAGGGAAA GGTCTCTGCC TCCCACCCTCCCAGGCCCAG CCCCCACCCT 60 GTCTCTCACG TGGTGATCTG AGACTCCAGG AATATGAGGATGAAGACCAG CAGAGCAGGC 120 AGGGCGGAGG CAAAATCATC CAGATGGGAA ACTCGGAACGCAAGCCCAGT GGGTGGATGA 180 CCCAGCCCCG GGCTGAGGAG TTGACACCTT GAAGCCATCAGGCACCGAGA GTTTCTGTGG 240 GAGGGGGTAG CAGGTAAGAA TGCCAAGGGC 270 253 basepairs nucleic acid unknown unknown DNA (genomic) unknown 3 GTGATAGGCACTGACCCCAG CCTCCGCCTG CAGGTGAAGA CCTGGCGCAT GCACTTATTC 60 ACGGGCATCCAGATCATCTG CCTGGCAGTG CTGTGGGTGG TGAAGTCCAC GCCGGCCTCC 120 CTGGCCCTGCCCTTCGTCCT CATCCTCACT GTGCCGCTGC GGCGCGTCCT GCTGCCGCTC 180 ATCTTCAGGAACGTGGAGCT TCAGTGTGTT GAGTGGCTGC CTGGGCCTGG GGCACAAGAG 240 CTGGGAGCATGCG 253 30 base pairs nucleic acid double unknown DNA (genomic) unknown4 TTCCAGGACC ACCCTAGAAG GTCCTGGTGG 30 38 base pairs nucleic acid doubleunknown DNA (genomic) unknown 5 TGAGACTCCA GGAATATCTA GACTCTGAGGTCCTTATA 38 30 base pairs nucleic acid double unknown DNA (genomic)unknown 6 ATCTGCCTGG CAGCTAGTAG ACGGACCGTC 30 44 base pairs nucleic aciddouble unknown DNA (genomic) unknown 7 CTAGAGGACG ATCGAGGATG GATCTCCTGCTAGCTCCTAC GATC 44 74 base pairs nucleic acid double unknown DNA(genomic) unknown 8 CTAGAGGACG ATCGAGGATG GATCTTCCAG GACCACCGATCTCCTGCTAG CTCCTACCTA 60 GAAGGTCCTG GTGG 74 82 base pairs nucleic aciddouble unknown DNA (genomic) unknown 9 CTAGAGGACG ATCGAGGATG GATCTGAGACTCCAGGAATA TGATCTCCTG CTAGCTCCTA 60 CCTAGACTCT GAGGTCCTTA TA 82 74 basepairs nucleic acid double unknown DNA (genomic) unknown 10 CTAGAGGACGATCGAGGATG GATCATCTGC CTGGCAGGAT CTCCTGCTAG CTCCTACCTA 60 GTAGACGGACCGTC 74 62 base pairs nucleic acid double unknown DNA (genomic) unknown11 CTAGAGGACG ATCGAGGATG GATCTTCCAG ATCTCCTGCT AGCTCCTACC TAGAAGGTCC 60TG 62 62 base pairs nucleic acid double unknown DNA (genomic) unknown 12CTAGAGGACG ATCGAGGATG GATCTGAGAG ATCTCCTGCT AGCTCCTACC TAGACTCTGA 60 GG62 62 base pairs nucleic acid double unknown DNA (genomic) unknown 13CTAGAGGACG ATCGAGGATG GATCATCTGG ATCTCCTGCT AGCTCCTACC TAGTAGACGG 60 AC62 20 base pairs nucleic acid double unknown DNA (genomic) unknown 14GCAGCAGACG TCGTCTCAGG 20 20 base pairs nucleic acid double unknown DNA(genomic) unknown 15 GCAGCAGACG TCGTCTCCTC 20 20 base pairs nucleic aciddouble unknown DNA (genomic) unknown 16 GCAGCAGACG TCGTCTGTCC 20 82 basepairs nucleic acid double unknown DNA (genomic) unknown 17 GCAGCAGAGTCCTGGAAGAT CCATCCAGCT AGCAGGAGAT CCGTCGTCTC AGGACCTTCT 60 AGGTAGGTCGATCGTCCTCT AG 82 82 base pairs nucleic acid double unknown DNA (genomic)unknown 18 GCAGCAGAGG AGTCTCAGAT CCATCCAGCT AGCAGGAGAT CCGTCGTCTCCTCAGAGTCT 60 AGGTAGGTCG ATCGTCCTCT AG 82 82 base pairs nucleic aciddouble unknown DNA (genomic) unknown 19 GCAGCAGACA GGCAGATGAT CCATCCAGCTAGCAGGAGAT CCGTCGTCTG TCCGTCTACT 60 AGGTAGGTCG ATCGTCCTCT AG 82 30 basepairs nucleic acid double unknown DNA (genomic) unknown 20 GCAGCAGAGTCCTCGTCGTC TCAGGACCTT 30 30 base pairs nucleic acid double unknown DNA(genomic) unknown 21 GCAGCAGAGG AGTCGTCGTC TCCTCAGAGT 30 30 base pairsnucleic acid double unknown DNA (genomic) unknown 22 GCAGCAGACAGGCCGTCGTC TGTCCGTCTA 30 34 base pairs nucleic acid double unknown DNA(genomic) unknown 23 GGAAGATCCT GGACAGTTGC TAGGACCTGT CAAC 34 34 basepairs nucleic acid double unknown DNA (genomic) unknown 24 CTCAGATCCTGGACAGTTGC TAGGACCTGT CAAC 34 34 base pairs nucleic acid double unknownDNA (genomic) unknown 25 AGATGATCCT GGACAGTTGC TAGGACCTGT CAAC 34 44base pairs nucleic acid double unknown DNA (genomic) unknown 26CTAGAGGACG ATCGAGGATG GATCTCCTGC TAGCTCCTAC CTAG 44 72 base pairsnucleic acid double unknown DNA (genomic) unknown 27 CCCTAGACTAGAGGACCGAT CGAATCAGCA GCAGAGATCT GATCTCCTGG CTAGCTTAGT 60 CGTCGTCTCA GG72 72 base pairs nucleic acid double unknown DNA (genomic) unknown 28CCCTAGACTA GAGGACCGAT CGAATCAGCA GCAGAGATCT GATCTCCTGG CTAGCTTAGT 60CGTCGTCTCC TC 72 72 base pairs nucleic acid double unknown DNA (genomic)unknown 29 CCCTAGACTA GAGGACCGAT CGAATCAGCA GCAGAGATCT GATCTCCTGGCTAGCTTAGT 60 CGTCGTCTGT CC 72 34 base pairs nucleic acid double unknownDNA (genomic) unknown 30 CAACTGTCCA GGATCGTTGA CAGGTCCTAG AAGG 34 34base pairs nucleic acid double unknown DNA (genomic) unknown 31CAACTGTCCA GGATCGTTGA CAGGTCCTAG ACTC 34 34 base pairs nucleic aciddouble unknown DNA (genomic) unknown 32 CAACTGTCCA GGATCGTTGA CAGGTCCTAGTAGA 34 92 base pairs nucleic acid double unknown DNA (genomic) unknown33 GGTCACTTAG ATCGATCCAT GAGGATGCTT CATTCTGATT CAGTCCCCAG TGAATCTAGC 60TAGGTACTCC TACGAAGTAA GACTAAGTCA GG 92 64 base pairs nucleic acid doubleunknown DNA (genomic) unknown 34 GCATCTGGAG TCTACAGTCG TCTATTGACGCGTAGACCTC AGATGTCAGC AGATAACTGC 60 CGGC 64 36 base pairs nucleic aciddouble unknown DNA (genomic) unknown 35 GCATCAGGAT GTACAGCGTA GTCCTACATGTCGCCA 36 27 base pairs nucleic acid single unknown DNA (genomic)unknown 36 GTATTTTCCA GCCCAAGCCA AAGCTGG 27 21 base pairs nucleic acidsingle unknown DNA (genomic) unknown 37 GCAGTGTTGG CAAGGACAGG C 21 21base pairs nucleic acid single unknown DNA (genomic) unknown 38GCCCTTGGCA TTCTTACCTG C 21 22 base pairs nucleic acid single unknown DNA(genomic) unknown 39 CTGGGAGAAT GCCAGGGAAA GG 22 20 base pairs nucleicacid single unknown DNA (genomic) unknown 40 GTGATAGGCA CTGACCCCAG 20 22base pairs nucleic acid single unknown DNA (genomic) unknown 41CGCATGCTCC CAGCTCTTGT GC 22

What is claimed is:
 1. A method for characterizing CDNA, which comprises: (a) cutting a sample comprising a population of one or more cDNAs or isolated fragments thereof, each having a strand complementary to the 3′ poly-A terminus of an mRNA and bearing a tail, with a first sampling endonuclease at a first sampling site of known displacement from a reference site proximal to the tail to generate from each cDNA or isolated fragment thereof a first and second sub-fragment, each comprising a sticky end sequence of predetermined length and unknown sequence, the first sub-fragment bearing the tail; (b) sorting either the first or second sub-fragments into sub-populations according to their sticky end sequence and recording the sticky end sequence of each sub-population as the.first sticky end; (c) cutting the sub-fragments in each sub-population with a second sampling endonuclease, which is the same as or different from the first sampling endonuclease, at a second sampling site of known displacement from the first sampling site to generate from each sub-fragment a further sub-fragment comprising a second sticky end sequence of predetermined length and unknown sequence; and (d) determining each second sticky end sequence; wherein the aggregate length of the first and second sticky end sequences of each sub-fragment is from 6 to 10; and wherein the sequences and relative positions of the reference site and first and second sticky ends are utilized to characterize the cDNA or cPNAs.
 2. A method according to claim 1, wherein the sample cut with the first sampling endonuclease comprises isolated fragments of the cDNAs produced by cutting a sample comprising a population of one or more cDNAs with a restriction endonuclease and isolating fragments whose restriction site is at the reference site.
 3. A method according to claim 2, wherein the first sampling endonuclease binds to a first recognition site and cuts at the first sampling site at a predetermined displacement from the restriction site of the restriction endonuclease.
 4. A method according to claim 3, wherein the first recognition site is provided in a first adaptor oligonucleotide which is hybridried to the restriction site of the isolated fragments.
 5. A method according to claim 2, wherein the restriction endonuclease recognizes a 4 base pair binding site.
 6. A method according to claim 2, wherein the second sub-fragments are sorted in step (b).
 7. A method according to claim 1, wherein the first sampling endonuclease binds to the reference site and cuts at the first sampling site at a predetermined displacement from the reference site.
 8. A method according to claim 1, wherein the first sampling endonuclease comprises a Type IIs endonuclease.
 9. A method according to claim 1, wherein the second sampling endonuclease binds to a second recognition site and cuts at the second sampling site at a predetermined displacement from the first sampling site.
 10. A method according to claim 9, wherein the second sampling endonuclease comprises a Type IIs endonuclease.
 11. A method according to claim 9, wherein the second recognition site is provided in a second adaptor oligonucleotide which is hybridized to the first sticky end.
 12. A method according to claim 1, wherein the tails of the cDNAs or fragments thereof are bound to a solid phase matrix.
 13. A method according to claim 1, wherein the aggregate length of the first and second sticky end sequences of each sub-fragment is
 8. 14. A method according to claim 13, wherein the length of each sticky end is
 4. 15. A method according to claim 1, wherein the step (b) of sorting the sub-fragments comprises dividing the sub-fragments into an array of samples, each sample in a separate container; contacting the array of samples with an array of solid phase affinity matrices, each solid phase affinity matrix bearing a unique base sequence of the same predetermined length as the first sticky end, so that each sample is contacted with one of the possible base sequences and the array of samples is contacted with all possible base sequences of that predetermined length for hybridization to occur only between each unique base sequence and first sticky end complementary with one another; and washing unhybridized material from the containers.
 16. A method according to claim 1, wherein the step (d) of determining each second sticky end sequence comprises isolating the further sub-fragments from step (c) and contacting the further sub-fragments with an array of adaptor oligonucleotides in a cycle, each adaptor oligonucleotide bearing a label and a unique base sequence of the same predetermined length as the second sticky end, the array containing all possible base sequences of that predetermined length; wherein the cycle comprises sequentially contacting each adaptor oligonucleotide of the array with each sub-population of isolated sub-fragments under hybridization conditions, removing any unhybridized adaptor oligonucleotide and determining the presence of any hybridized adaptor oligonucleotide by detection of the label, then repeating the cycle, until all of the adaptors in the array have been tested.
 17. A method according to claim 1, wherein the step (b) of sorting the sub-fragments comprises (i) binding the sub-fragments to a hybridization array comprising an array of oligonucleotide sets, each set bearing a unique base sequence of the same predetermined length as the first sticky end and identifiable by location in the array, all possible base sequences of that predetermined length being present in the array, so that each sub-population bearing its unique first sticky end is hybridized at an identifiable location in the array; and (ii) determining the location to identify the first sticky end sequence.
 18. A method according to claim 1, wherein the sub-fragments cut in step (c) are those bound to the hybridization array so that the further sub-fragments generated thereby remain bound to the hybridization array; and wherein the step (d) of determining each second sticky end sequence comprises contacting the further sub-fragments under hybridization conditions with an array of adaptor oligonucleotides, each adaptor oligonucleotide bearing a label and a unique base sequence of the same predetermined length as the second sticky end, the array containing all possible base sequences of that predetermined length, removing any unhybridized adaptor oligonucleotide, and determining the location of any hybridized adaptor oligonucleotide by detection of the label.
 19. A method for identifying cDNA in a sample, which comprises characterizing cDNA in accordance with a method according to any one of the preceding claims, comparing the sequences and relative positions of the reference site and first and second sticky ends obtained thereby with the sequences and relative positions of the reference site and first and second sticky ends of known cDNAs in order to identify each CDNA in the sample.
 20. A method for assaying for one or more specific cDNAs in a sample, which comprises performing a method according to claim 1, wherein the reference site is predetermined, each first sticky end sequence in sorting step (b) is a predetermined first sticky end sequence, each second sticky sequence in step (d) is determined by assaying for a predetermined second sticky end sequence, and the relative positions of the reference site and predetermined first and second sticky ends characterize the or each specific cDNA.
 21. A method according to claim 20, wherein the reference site and first and second sticky end sequences are predetermined by selecting corresponding sequences from one or more known target cDNAs. 